About the project

I am excited about this project because I think I may be able to learn some R things.

My GitHub repository is at https://github.com/pyrysipila/IODS-project


Chapter 2: Linear regression

Describe the work you have done this week and summarize your learning.

I have worked on data wrangling and linear regression. I think I have learned a lot. Please see below.

I will start by reading in the learning2014 data from my own data folder.

learning2014 <- read.csv("//atkk/home/p/pyrysipi/Documents/Tutkimus/Kurssit/Introduction to Open Data Science 2018/Data/learning2014.csv")

The data were collected from students in Jyväskylä University in 2014-2015. The aim was to assess how attitude and different learning approaches predict success on an introductory statistics course.

More information on the data can be found here: http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt

Below I will explore the structure and dimensions of the data.

#dimensions of the data:
dim(learning2014)
## [1] 166   7
#the data have 166 observations and 7 variables

#structure of the data
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
#First six rows of the data:
head(learning2014)
##   gender Age attitude     deep  stra     surf Points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Next, I will describe the data with nice plots and summaries of the variables

# A scatter plot matrix
pairs(learning2014[-1], col = learning2014$gender)
library(ggplot2)

library(GGally)
p1 <- ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
p1

summary(learning2014)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

From the plots and summaries above, it can be seen that the data consists of 110 women and 56 men. Mean age is 25.5 years. Age is unevenly distributed so that there are some people who are much older than average but no people who are much younger than average. This is called a skewed distribution. Other variables have a distribution that at least resembles a normal distribution.

Attitude has a moderate correlation with points (r~0.4). (Better attitude is associated with better success on the course.) Deep (deep approach) has a moderate negative correlation (r~-0.3) with surf (surface approach). This adds to the validity of these constructs, because one would assume deep and surface approaches to be antagonistic with each other. Correlation between other variables are low (|r|<0.2)

Next, I will fit a linear regression model with points as the dependent variable.

lm1 <- lm(Points ~ attitude + deep + stra, data = learning2014)
summary(lm1)
## 
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

The intercept is statistically significant with an aplha level of 0.05 (p<0.05). Attitude is a statistically significant predictor of points but deep and stra are not. I will drop deep and stra from the model.

lm2 <- lm(Points ~ attitude, data = learning2014)
summary(lm2)
## 
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

For every one points increase in attitude, the student will gain on average ~3.5 points. R-squared is ~0.19, which means that the model explains 19% of the variation in Points. Because attitude is the only vairable in the model, it explains those 19% of variation in Points.

Next, I will draw the diagnostic plots of the linear model I have just constructed (lm2).

plot(lm2, which = c(1,2,5))

The residuals vs fitted plot does not show any obvious patterns. This indicates that the model fits the data reasonably well and the data are homoscedastic.

In the normal Q-Q plot the dots are reasonably close to the diagonal line. This means that the distribution of the residuals is reasonably close to normal distributions. Therefore, the assumption of linear regression of normally distributed residuals is not violated.

The residuals vs leverage plot does not identify any points with a high Cook’s distance (no points are below the dashed read line indicating a Cook’s distance of 0.5). This means that there are no outliers that would have potential to distort the regression model.


Chapter 3: Logistic regression

Import the data from my computer

alc <- read.csv("//atkk/home/p/pyrysipi/Documents/Tutkimus/Kurssit/Introduction to Open Data Science 2018/Data/alc.csv")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

The data

The data were collected from secondary students in Portugal. They contain information on alcohol use and performance in mathematics and portuguese, and some other information as well.

More information on the data can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

I will examine the relationships of sex, Medu (Mother’s education), Fedu (Father’s education) and famsize (family size) with high alcohol use.

My hypotheses are that: 1) men have a higher probability of high alcohol use than women, 2) higher maternal education is associated with lower probability of high alcohol use, 3) higher paternal education is associated with lower probability of high alcohol use and 4) bigger family size is associated withlower probability of high alcohol use and

Cross tabulations

table(sex = alc$sex, high_use = alc$high_use) %>% addmargins()
##      high_use
## sex   FALSE TRUE Sum
##   F     156   42 198
##   M     112   72 184
##   Sum   268  114 382
table(Medu = alc$Medu, high_use = alc$high_use) %>% addmargins()
##      high_use
## Medu  FALSE TRUE Sum
##   0       1    2   3
##   1      33   18  51
##   2      80   18  98
##   3      59   36  95
##   4      95   40 135
##   Sum   268  114 382
table(Fedu = alc$Fedu, high_use = alc$high_use) %>% addmargins()
##      high_use
## Fedu  FALSE TRUE Sum
##   0       2    0   2
##   1      53   24  77
##   2      75   30 105
##   3      72   27  99
##   4      66   33  99
##   Sum   268  114 382
table(Family_size = alc$famsize, high_use = alc$high_use) %>% addmargins()
##            high_use
## Family_size FALSE TRUE Sum
##         GT3   201   77 278
##         LE3    67   37 104
##         Sum   268  114 382

Plots

library(ggplot2)
g_sex <- ggplot(alc, aes(high_use, ..count..))
g_sex + geom_bar(aes(fill=sex), position = "dodge")

g_Medu <- ggplot(alc, aes(x = high_use, y = Medu))
g_Medu + geom_boxplot()

g_Fedu <- ggplot(alc, aes(x = high_use, y = Fedu))
g_Fedu + geom_boxplot()

g_famsize <- ggplot(alc, aes(x = high_use, ..count..))
g_famsize + geom_bar(aes(fill=famsize), position = "dodge")

Men seem to be high users of alcohol more often than women and high use seems to be more common in small families (with 3 or less than 3 members) than in big families (with more than 3 members). Thus, my hypotheses 1 and 4 seem to be correct.

In copntrast, maternal education does not seem to be associated with high alcohol use and paternal education seems to be associated with higher probability of high alcohol use. Thus, my hypotheses 2 and 3 seem to be false.

Logistic regression

m <- glm(high_use ~ sex + Medu + Fedu + famsize, data=alc, family = "binomial")
m
## 
## Call:  glm(formula = high_use ~ sex + Medu + Fedu + famsize, family = "binomial", 
##     data = alc)
## 
## Coefficients:
## (Intercept)         sexM         Medu         Fedu   famsizeLE3  
##    -1.40340      0.85627     -0.06956      0.08132      0.29581  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  377 Residual
## Null Deviance:       465.7 
## Residual Deviance: 449.2     AIC: 459.2

Male sex, higher father’s education (per one unit increase in education) and small familysize (3 or less than 3 members) have positive coefficients indicating a positive association with high alcohol use.

Higher mother’s education (per one unit increase in education) has a negative coefficient indicating a negative association with high alcohol use.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.2457589 0.1194573 0.4895376
## sexM        2.3543727 1.4978628 3.7375354
## Medu        0.9328005 0.7093940 1.2255525
## Fedu        1.0847204 0.8293229 1.4235393
## famsizeLE3  1.3442147 0.8179982 2.1900766

Men have 2.4 times higher odds than women to be high alcohol users. The association is statistically significant (at alpha level 0.05), because the 95% confidence interval does not include 1. My hypothesis 1 seems to be correct.

Each increase of one unit in mother’s educations makes the odds to be high alcohol users 0.93 times lower. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 2 cannot be confirmed.

Each increase of one unit in father’s educations makes the odds to be high alcohol user 1.08 times higher. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 3 cannot be confirmed.

Those from small families have 1.3 times higher odds than those from big families to be high alcohol users. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 4 cannot be confirmed.

Predictive power of the model

Only sex will be included in the predictive model because it was the only variable with a statistically significant association with high alcohol use.

# fit the model
m <- glm(high_use ~ sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, high_use, probability, prediction) %>% tail(10)
##     sex high_use probability prediction
## 373   M    FALSE   0.3913043      FALSE
## 374   M     TRUE   0.3913043      FALSE
## 375   F    FALSE   0.2121212      FALSE
## 376   F    FALSE   0.2121212      FALSE
## 377   F    FALSE   0.2121212      FALSE
## 378   F    FALSE   0.2121212      FALSE
## 379   F    FALSE   0.2121212      FALSE
## 380   F    FALSE   0.2121212      FALSE
## 381   M     TRUE   0.3913043      FALSE
## 382   M     TRUE   0.3913043      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE
##    FALSE   268
##    TRUE    114
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use     FALSE
##    FALSE 0.7015707
##    TRUE  0.2984293
#plot the predictions and actual values

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2984293

The model predicts everyone to be not a high alcohol user. This is wrong in 30% of the cases. The model performs better than flipping a coin: a strategy that would be wrong in 50% of the cases.

Bonus:

10-fold cross validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293

The model has a prediciton error of 0.30. I’ll try to make a better one by including the same variables as in the DataCamp excercise and a couple of extra variables:

# fit the model
m <- glm(high_use ~ school + sex + age + address + famsize + failures + absences, data = alc, family = "binomial")
m
## 
## Call:  glm(formula = high_use ~ school + sex + age + address + famsize + 
##     failures + absences, family = "binomial", data = alc)
## 
## Coefficients:
## (Intercept)     schoolMS         sexM          age     addressU  
##    -3.18210      0.19912      0.92655      0.09062     -0.40680  
##  famsizeLE3     failures     absences  
##     0.30259      0.42108      0.09196  
## 
## Degrees of Freedom: 381 Total (i.e. Null);  374 Residual
## Null Deviance:       465.7 
## Residual Deviance: 419   AIC: 435
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
summary(alc$probability)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09712 0.17396 0.26337 0.29843 0.38262 0.93004
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     79   35
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.65968586 0.04188482
##    TRUE  0.20680628 0.09162304
#plot the predictions and actual values

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486911
#perform 10-fold croos-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2643979

Now the prediciton error is 0.27. It seems to be that I cannot beat DataCamp.


Chapter 4: Clustering and classification

Analysis exercises

1

A new markdown file has been created.

2

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Boston is a classic dataset describing living conditions such as urban environment, location, crime rate, air quality and economic issues in towns in the suburbs of Boston. It has 506 observations and 14 variables. More information on the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

3

#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.2.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
#histogram
Boston %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#scatter plot of all variables in the dataset
plot(Boston)

# calculate the correlation matrix,round it and print it
cor_matrix<-cor(Boston)
cor_matrix <- cor_matrix %>% round(digits=2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos ="d",tl.cex=0.6)

Many variables have skewed distributions. Rm (average number of rooms per dwelling) has a nice distribution close to normal.

There are quite a few strong correlations. The strongest positive correlation is between tax (full-value property-tax rate per $10,000) and rad (index of accessibility to radial highways) (0.91) and the strongest negative correlation is between dis (weighted mean of distances to five Boston employment centres) and nox (nitrogen oxides concentration (parts per 10 million)) (-0.77).

4

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical variable to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

All the scaled variables have mean zero (and standard deviation 1).

5

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2425743 0.2500000 0.2623762 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0966152 -0.9293940 -0.15302300 -0.9085522  0.39563762
## med_low  -0.1201594 -0.3670518 -0.07145661 -0.5953865 -0.13481912
## med_high -0.3836558  0.1629760  0.23442641  0.4041304  0.06831498
## high     -0.4872402  1.0170298 -0.01233188  1.0255059 -0.41607572
##                 age        dis        rad        tax     ptratio
## low      -0.9654355  1.0074428 -0.6953345 -0.7215075 -0.43483412
## med_low  -0.3509550  0.3719930 -0.5588134 -0.5432244 -0.06665715
## med_high  0.4248400 -0.3696329 -0.4246941 -0.3088871 -0.24517066
## high      0.7828194 -0.8404207  1.6390172  1.5146914  0.78181164
##                black       lstat        medv
## low       0.37054878 -0.75571378  0.47561323
## med_low   0.34598532 -0.14093571  0.01107822
## med_high  0.08102381  0.03345678  0.12841597
## high     -0.71921086  0.83753641 -0.63614925
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.09443167  0.73684375 -0.95754464
## indus    0.09199207 -0.14025300  0.27516049
## chas    -0.09335274 -0.09623101  0.01975018
## nox      0.31020209 -0.52735211 -1.42442757
## rm      -0.14357802 -0.10349530 -0.18943586
## age      0.21200407 -0.42878099 -0.13105976
## dis     -0.04508493 -0.14552812  0.09491612
## rad      3.32313031  1.19234534  0.34723668
## tax      0.11352496 -0.34230065  0.15762337
## ptratio  0.11267842  0.03697259 -0.33956654
## black   -0.11622480  0.03142054  0.18077624
## lstat    0.22936706 -0.19385382  0.36223286
## medv     0.23634775 -0.29574077 -0.16814036
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9486 0.0392 0.0122
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit)

6

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      15        2    0
##   med_low    8      11        9    0
##   med_high   0       5       18    2
##   high       0       0        0   21

The proportion of correct predictions is 14/27 (52%) for low crime rate, 20/30 (67%) for med_low crime rate, 16/22 (73%) for med_high crime rate and 23/23 (100%) for high crime rate. Overall, the predictions are quite good (proportion of correct predictions 73/102~72%), and the higher the crime rate the better the predictions.

7

#Reload and scale the Boston data
data("Boston")
boston_scaled <- scale(Boston)

set.seed(123)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
#visualize it
pairs(Boston, col = km$cluster)

##investigate the optimal number of clusters
#set seed for reproducibility
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

#2 seems to be an optimal number of clusters, because the total within sum of squares drops dramatically when moving from one cluster to two clusters.

#run the algorithm again
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Two is the optimal number of clusters. Two clusters seem to differentiate well between towns of low and high crime rate. Other variables do not seem to be able to clearly differentiate between the clusters.

Bonus

#Relaod and scale the Boston data
data("Boston")
boston_scaled <- scale(Boston)
#change the data to a data frame
boston_scaled <- as.data.frame(boston_scaled)

# k-means clustering
km <-kmeans(boston_scaled, centers = 3)

#add the clusters to the data
boston_scaled <- data.frame(boston_scaled, km$cluster)
glimpse(boston_scaled)
## Observations: 506
## Variables: 15
## $ crim       <dbl> -0.4193669, -0.4169267, -0.4169290, -0.4163384, -0....
## $ zn         <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, ...
## $ indus      <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1....
## $ chas       <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0....
## $ nox        <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0....
## $ rm         <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.22736...
## $ age        <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, ...
## $ dis        <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711...
## $ rad        <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0....
## $ tax        <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1....
## $ ptratio    <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.11...
## $ black      <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.44061...
## $ lstat      <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078,...
## $ medv       <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1....
## $ km.cluster <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
#perform LDA using the clusters as target classes
lda.fit <- lda(km.cluster ~ ., data = boston_scaled)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit)

Rad (index of accessibility to radial highways is the most influential linear separator for the clusters. Other influental linear separators are tax (full-value property-tax rate per $10,000), age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres).

Super bonus

lda.fit <- lda(crime ~ ., data = train)

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
train2 <- train
train2$num_crime <- as.numeric(train2$crime)
train2$num_crime <- scale(train2$num_crime)
train2$crime <- train2$num_crime
train2 <- select(train2,-num_crime)
km <- kmeans(train2, centers = 2)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)

There is an obvious difference: crime has four classes but kmeans only two clusters. Apart that, The plots look surprisingly similar. The clusters formed by k-means clustering seem to mostly differentiate between the towns with high crime rate vs the towns with lower crime rates.